# Description of Functionality

## Assets & Liabilities Extraction

The assets and liabilities tables are extracted in a two-stage process. In the first stage, we begin by parsing each HTML filing and building a semantic tree using the package `sec_parser`. Then, we perform a breadth-first search through this semantic tree to find `TableElement` components with parent or immediate sibling `TitleElement` components that contain either "assets and liabilities" or "balance sheet". These tables are then returned and saved in `output_al_train`. The purpose of this step is to quickly generate a set of tables that are identified with high accuracy as A&L tables, rather than identifying all tables.

In the next step, we change the paradigm and use a trained classifier instead of searching for tables by their headings. We use the structure of a balance sheet where many words are repeated, employing the bag-of-words model to view each table as a distribution of words. Specifically, we compute the count of the (cleaned) value in each cell of each table and normalize it. This gives us a "prototype" A&L sheet distribution. Using this distribution, we score each table in the complete dataset by computing the cosine similarity between that table’s word distribution and our prototype distribution. The table with the highest score is likely the A&L sheet. This process is repeated for each table to generate our complete set.

With these DataFrames stored as CSV files, we load each of them into memory and extract the line items "Total Assets" and "Total Liabilities" along with the year. Finally, we can plot these using Matplotlib.

## Schedule of Investments Extraction

The approach treating tables as a word distribution doesn't work well for extracting the schedule of investments due to the structured nature of these tables in terms of line items and word distribution variability. Thus, we take a different approach. Starting with the semantic tree from above, we perform a depth-first search to find tables. Beginning at the root, we search through the tree to find `TitleElement` components that contain any of a set of keywords synonymous with "schedule of investments". If found, we enter the "title found context" and continue searching the tree. If we encounter a table while in this context, it is presumed to be the schedule of investments. Additionally, if any of the elements encountered while in this context contain patterns such as "thousand", "million", or "billion", we save this as the units. Once a table is found, we exit the "title found context" and continue searching. We do not return here, as many filings present the schedule of investments in several tables. Finally, we prune any branches of the tree with headings containing the year preceding the year of the filing to avoid finding tables from the previous year, which are sometimes included in filings.

Subsequently, we apply the identified units to each table and return them. Next, we read all these tables and convert them into a DataFrame of investments. This is accomplished by reading each table CSV and identifying columns containing relevant headings like "par value", "fair value", and "cost". Once identified, we iterate through each row and eliminate non-investment rows (e.g., group headings or notes) using keyword matching and verifying that object columns contain numeric values. With this information, we generate plots using Matplotlib once again.

# Steps to Run Replication Materials

1. In `Create_AL_Small_Set.ipynb`, run every cell from top to bottom.
2. In `Extract_AL_Tables.ipynb`, run every cell from top to bottom.
3. In `Extract_Inv_Tables.ipynb`, run every cell from top to bottom.
4. In `Produce_Output.ipynb`, run every cell from top to bottom.

**Note:** Due to the multiprocessing that greatly accelerates the runtime, it occasionally happens that the file processing in the first three notebooks will process all files but fail to terminate fully. In this case, you can terminate the processing once no more CSVs are being written.
